-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed page data truncation in parquet writer under certain conditions. #15474
Fixed page data truncation in parquet writer under certain conditions. #15474
Conversation
…e dict_bits field in parquet page headers.
@vuule Need expert eyes here. I'm not an expert on the writer. |
One additional question here is: how might we add a test for this. Our reader handles it just fine. The only symptom is a crash in a third-party reader. Maybe it's not necessarily if we can verify this fixes things for the external use case. |
How about pandas? Can pandas handle such situation like our reader? |
I'll give it a try. I'm guessing the answer is yes though, since we've been generating data like this for a long time now and it would almost certainly have shown up in the python tests. |
Spark integration tests and cudf parquet+compute-sanitizer tests passed. |
Yeah, Pandas handles the problem files just fine. |
Seems like a bug in the reader that's failing. The page header will have the correct size if it leaves off the
A test could write a null column, then pull the header for the first data page and check that the Edit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @etseidl here if Arrow writes the 0
in this case, we can do it too without impacting any other functionality.
/merge |
Fixes #15473
The issue is that in some cases, for example where we have all nulls, we can fail to update the size of the page output buffer, resulting in a missing byte expected by some readers. Specifically, we poke the value of dict_bits into the output buffer here:
cudf/cpp/src/io/parquet/page_enc.cu
Line 1892 in 6319ab7
But, if we have no leaf values (for example, because everything in the page is null)
s->cur
never gets updated here, because we never enter the containing loop.cudf/cpp/src/io/parquet/page_enc.cu
Line 1948 in 6319ab7
The fix is to just always update
s->cur
after this if-else blockcudf/cpp/src/io/parquet/page_enc.cu
Line 1891 in 6319ab7
Note that this was already handled by our reader. But some third party readers (Trino) are expecting that data to be there and crash if it's not.
Checklist